Experimenting N-Grams in Text Categorization

نویسندگان

  • Abdellatif Rahmoun
  • Zakaria Elberrichi
چکیده

This paper deals with automatic supervised classification of documents. The approach suggested is based on a vector representation of the documents centred not on the words but on the n-grams of characters for varying n. The effects of this method are examined in several experiments using the multivariate chi-square to reduce the dimensionality, the cosine and Kullback&Liebler distances, and two benchmark corpuses the reuters-21578 newswire articles and the 20 newsgroups data for evaluation. The evaluation was done, by using the macroaveraged F1 function. The results show the effectiveness of this approach compared to the Bag-OfWord and stem representations.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Serbian Text Categorization Using Byte Level n-Grams

This paper presents the results of classifying Serbian text documents using the byte-level n-gram based frequency statistics technique, employing four different dissimilarity measures. Results show that the byte-level n-grams text categorization, although very simple and language independent, achieves very good accuracy.

متن کامل

A Biomimetic Approach Based on Immune Systems for Classification of Unstructured Data

In this paper we present the results of unstructured data clustering in this case a textual data from Reuters 21578 corpus with a new biomimetic approach using immune system. Before experimenting our immune system, we digitalized textual data by the n-grams approach. The novelty lies on hybridization of n-grams and immune systems for clustering. The experimental results show that the recommende...

متن کامل

Language-independent text categorization by word N-gram using an automatic acquisition of words

We previously proposed the accumulation method, a language-independent text classification method that is based on character N-grams. The accumulation method does not depend on the language structure because this method uses character N-grams to form

متن کامل

A Learner-Independent Evaluation of the Usefulness of Statistical Phrases for Automated Text Categorization

In this work we investigate the usefulness of n-grams for document indexing in text categorization (TC). We call n-gram a set gk of n word stems, and we say that gk occurs in a document dj when a sequence of words appears in dj that, after stop word removal and stemming, consists exactly of the n stems in gk, in some order. Previous researches have investigated the use of n-grams (or some varia...

متن کامل

A Study Using n-gram Features for Text Categorization

In this paper, we study the effect of using n-grams (sequences of words of length n) for text categorization. We use an efficient algorithm for generating such n-gram features in two benchmark domains, the 20 newsgroups data set and 21,578 REUTERS newswire articles. Our results with the rule learning algorithm R IPPER indicate that, after the removal of stop words, word sequences of length 2 or...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Int. Arab J. Inf. Technol.

دوره 4  شماره 

صفحات  -

تاریخ انتشار 2007